NFL

Homework
Author

Abby Hofschneider

Published

December 12, 2023

1 Question 2

1.1 Summary Statistics

rmarkdown::paged_table(NFL2022_stuffs)
skim(NFL2022_stuffs) %>% 
  select(-n_missing)
Data summary
Name NFL2022_stuffs
Number of rows 50147
Number of columns 10
_______________________
Column type frequency:
character 2
numeric 8
________________________
Group variables None

Variable type: character

skim_variable complete_rate min max empty n_unique whitespace
game_id 1.00 13 15 0 284 0
posteam 0.93 2 3 0 32 0

Variable type: numeric

skim_variable complete_rate mean sd p0 p25 p50 p75 p100 hist
play_id 1.00 2057.86 1194.22 1 1039.00 2034.00 3065.50 5523 ▇▇▇▅▁
drive 0.99 11.48 6.59 1 6.00 11.00 17.00 35 ▇▇▇▂▁
week 1.00 9.91 5.61 1 5.00 10.00 15.00 22 ▇▆▆▆▃
qtr 1.00 2.58 1.14 1 2.00 3.00 4.00 5 ▆▇▆▇▁
down 0.83 2.00 1.00 1 1.00 2.00 3.00 4 ▇▆▁▃▂
half_seconds_remaining 1.00 796.94 564.41 0 255.00 774.00 1285.00 1800 ▇▅▅▅▅
pass 1.00 0.45 0.50 0 0.00 0.00 1.00 1 ▇▁▁▁▆
wp 0.99 0.51 0.29 0 0.29 0.52 0.73 1 ▆▆▇▆▆

1.2 Q2a

In data.frame, NFL2022_stuffs, remove observations for which values of posteam is missing

NFL2022_stuffs <- na.omit(NFL2022_stuffs[!is.na(NFL2022_stuffs$posteam), ])

rmarkdown::paged_table(NFL2022_stuffs)

1.3 Q2b

Summarize the mean value of pass for each posteam when all of the following conditions hold: 1. wp is greater than 20% and less than 75% 2. down is less than or equal to 2 3. half_seconds_remaining is greater than 120

filtered_data <- NFL2022_stuffs[NFL2022_stuffs$wp > 0.20 & NFL2022_stuffs$wp < 0.75 &
                                  NFL2022_stuffs$down <= 2 &
                                  NFL2022_stuffs$half_seconds_remaining > 120, ]
mean_pass_by_posteam <- aggregate(filtered_data$pass, by=list(filtered_data$posteam), FUN=mean)
colnames(mean_pass_by_posteam) <- c("posteam", "mean_pass")
print(mean_pass_by_posteam)
   posteam mean_pass
1      ARI 0.5528455
2      ATL 0.4000000
3      BAL 0.5198330
4      BUF 0.6043956
5      CAR 0.4578947
6      CHI 0.4198312
7      CIN 0.6567460
8      CLE 0.4908722
9      DAL 0.4742647
10     DEN 0.4930796
11     DET 0.4906542
12      GB 0.5088496
13     HOU 0.4793388
14     IND 0.4938525
15     JAX 0.5207921
16      KC 0.6376068
17      LA 0.5104895
18     LAC 0.6076190
19      LV 0.4921569
20     MIA 0.5334646
21     MIN 0.5555556
22      NE 0.5208333
23      NO 0.4214464
24     NYG 0.5153846
25     NYJ 0.5061728
26     PHI 0.5801217
27     PIT 0.4796296
28     SEA 0.5662188
29      SF 0.4805726
30      TB 0.5529412
31     TEN 0.4342723
32     WAS 0.4054581

1.4 Q2c

Provide both (1) a ggplot code with geom_point() using the resulting data.frame in Q2b and (2) a simple comments to describe the mean value of pass for each posteam. In the ggplot, reorder the posteam categories based on the mean value of pass in ascending or in descending order

library(ggplot2)

mean_pass_by_posteam$posteam <- factor(mean_pass_by_posteam$posteam, 
                                       levels = mean_pass_by_posteam$posteam[order(mean_pass_by_posteam$mean_pass)])


ggplot(mean_pass_by_posteam, aes(x = mean_pass, y = posteam)) +
  geom_point() +
  labs(x = "Percent of Pass Plays", y = "Team with Possession", title = "Team vs Percent of Pass Plays") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

1.5 Q2d

Consider the following data.frame:

NFL2022_epa <- read_csv('https://bcdanl.github.io/data/NFL2022_epa.csv')
rmarkdown::paged_table(NFL2022_epa)

Create the data.frame, NFL2022_stuffs_EPA, that includes:

  1. All the variables in the data.frame, NFL2022_stuffs
  2. The variables, passer, receiver, and epa, from the data.frame, NFL2022_epa by joining the two data.frames

In the resulting data.frame, NFL2022_stuffs_EPA, remove observations with NA in passer

NFL2022_epa <- read_csv('https://bcdanl.github.io/data/NFL2022_epa.csv')
NFL2022_stuffs_EPA <- merge(NFL2022_stuffs, NFL2022_epa[, c("game_id", "passer", "receiver", "epa")], by = "game_id")
NFL2022_stuffs_EPA <- NFL2022_stuffs_EPA[complete.cases(NFL2022_stuffs_EPA$passer), ]

rmarkdown::paged_table(NFL2022_stuffs_EPA)

1.6 Q2e

Provide both (1) a single ggplot and (2) a simple comment to describe the NFL weekly trend of weekly mean value of epa for each of the following two passers: 1. "J.Allen" 2. "P.Mahomes"

library(ggplot2)

selected_passers <- c("J.Allen", "P.Mahomes")
filtered_data <- NFL2022_stuffs_EPA[NFL2022_stuffs_EPA$passer %in% selected_passers, ]
filtered_data$week <- factor(filtered_data$week, levels = unique(filtered_data$week))
weekly_mean_epa <- aggregate(epa ~ week + passer, data = filtered_data, FUN = mean)

ggplot(weekly_mean_epa, aes(x = week, y = epa, color = passer, group = passer)) +
  geom_line() +
  labs(x = "Week", y = "Weekly Mean EPA", title = "NFL Weekly Trend of Mean EPA for J.Allen and P.Mahomes") +
  theme_minimal()

Patrick Mahomes generally has a higher weekly mean epa than Josh Allen

1.7 Q2f

Calculate the difference between the mean value of epa for "J.Allen" the mean value of epa for "P.Mahomes" for each value of week.

selected_passers <- c("J.Allen", "P.Mahomes")
filtered_data <- NFL2022_stuffs_EPA[NFL2022_stuffs_EPA$passer %in% selected_passers, ]
mean_epa_by_week <- aggregate(epa ~ week + passer, data = filtered_data, FUN = mean)
epa_diff <- reshape(mean_epa_by_week, idvar = "week", timevar = "passer", direction = "wide")
epa_diff$epa_diff <- epa_diff$epa.J.Allen - epa_diff$epa.P.Mahomes
print(epa_diff)
   week epa.J.Allen epa.P.Mahomes    epa_diff
1     1  0.52963415    0.69840404 -0.16876989
2     2  0.48691617    0.14841216  0.33850401
3     3  0.16932725    0.24559401 -0.07626677
4     4  0.19104682    0.27137549 -0.08032867
5     5  0.62742248    0.30228470  0.32513777
6     6  0.30652151    0.13313721  0.17338430
7     8  0.22419910            NA          NA
8     9 -0.20799939    0.09646711 -0.30446651
9    10  0.16051785    0.58904325 -0.42852541
10   11  0.19206366    0.36503570 -0.17297205
11   12  0.09828258    0.24726968 -0.14898710
12   13  0.33021344    0.20622354  0.12398990
13   14 -0.06207961    0.13106472 -0.19314433
14   15  0.25693067    0.32195856 -0.06502788
15   16  0.02143551    0.12156763 -0.10013212
16   18  0.20865931    0.17297609  0.03568322
17   19 -0.20950326            NA          NA
18   20 -0.04289048    0.27933023 -0.32222071
25    7          NA    0.70130690          NA
34   17          NA    0.19847047          NA
37   21          NA    0.19610416          NA
38   22          NA    0.55937371          NA

1.8 Q2g

Summarize the resulting data.frame in Q2d, with the following four variables:

  1. posteam: String abbreviation for the team with possession.
  2. passer: Name of the player who passed a ball to a receiver by initially taking a three-step drop, and backpedaling into the pocket to make a pass. (Mostly, they are quarterbacks.)
  3. mean_epa: Mean value of epa in 2022 for each passer
  4. n_pass: Number of observations for each passer

Then find the top 10 NFL passers in 2022 in terms of the mean value of epa, conditioning that n_pass must be greater than or equal to the third quantile level of n_pass

library(dplyr)
summary_data <- NFL2022_stuffs_EPA %>%
  group_by(posteam, passer) %>%
  summarise(mean_epa = mean(epa),
            n_pass = n())
third_quantile <- quantile(summary_data$n_pass, 0.75)
top_passers <- summary_data %>%
  filter(n_pass >= third_quantile) %>%
  top_n(10, mean_epa) %>%
  arrange(desc(mean_epa))
print(top_passers)
# A tibble: 183 × 4
# Groups:   posteam [32]
   posteam passer      mean_epa n_pass
   <chr>   <chr>          <dbl>  <int>
 1 CHI     K.Cousins      0.474   3676
 2 CHI     J.Goff         0.473   4036
 3 SEA     B.Purdy        0.466   4736
 4 LA      J.Garoppolo    0.458   4088
 5 JAX     P.Mahomes      0.456   5237
 6 BUF     J.Brissett     0.442   3744
 7 CAR     J.Goff         0.359   3900
 8 IND     T.Lawrence     0.326   4428
 9 TEN     T.Lawrence     0.290   6042
10 MIN     J.Goff         0.288   6299
# ℹ 173 more rows